TEMPRO: nanobody melting temperature estimation model using protein embeddings

Single-domain antibodies (sdAbs) or nanobodies have received widespread attention due to their small size (~ 15 kDa) and diverse applications in bio-derived therapeutics. As many modern biotechnology breakthroughs are applied to antibody engineering and design, nanobody thermostability or melting temperature (Tm) is crucial for their successful utilization. In this study, we present TEMPRO which is a predictive modeling approach for estimating the Tm of nanobodies using computational methods. Our methodology integrates various nanobody biophysical features to include Evolutionary Scale Modeling (ESM) embeddings, NetSurfP3 structural predictions, pLDDT scores per sdAb region from AlphaFold2, and each sequence’s physicochemical characteristics. This approach is validated with our combined dataset containing 567 unique sequences with corresponding experimental Tm values from a manually curated internal data and a recently published nanobody database, NbThermo. Our results indicate the efficacy of protein embeddings in reliably predicting the Tm of sdAbs with mean absolute error (MAE) of 4.03 °C and root mean squared error (RMSE) of 5.66 °C, thus offering a valuable tool for the optimization of nanobodies for various biomedical and therapeutic applications. Moreover, we have validated the models’ performance using experimentally determined Tms from nanobodies not found in NbThermo. This predictive model not only enhances nanobody thermostability prediction, but also provides a useful perspective of using embeddings as a tool for facilitating a broader applicability of downstream protein analyses.

consists of four chains: two identical light chains and two identical heavy chains which corresponds to the wellknown Y shape of an antibody (Fig. 1A).Single variable domain on a heavy chain (sdAb) structural characteristic pertains to the fragment of an antibody consisting of a single monomeric variable domain separate from its light chain.VHHs contain 4 framework regions (FRs) that form the core structure of the immunoglobulin domain and 3 complementarity-determining regions (CDRs) that are involved in antigen binding (Fig. 1B,C).FRs are conserved regions of the antibody which allow the antigen-binding, hypervariable CDR regions to be stable 8 ; these are summarized in Table 1.Table 1.Roles of framework (FR) and complementarity-determining regions (CDR).

FR1
Framework Region 1 (N-terminus) Possible target of efficient mutagenesis for generating a variety of affinity-matured scFv mutants 9

CDR1
Complementarity-determining Region 1 4 residues after first cysteine 6-15 residues Extended hypervariable CDR1 region (residues 27-30) is used together with CDR3 to increase surface area interacting with the antigen, and thus proposed to vary these amino acids for synthesis to increase potential antigen binding 10

FR2
Framework Region 2 Dromedary VHHs have an extended CDR3 that is often stabilized by an additional disulfide bond with a cysteine in CDR1 or FR2 11

CDR2
Complementarity-determining Region 2 10 to 20 residues after end of CDR1 8-15 residues Can provide additional antigen interaction force with CDR3 12 and an unusual extra glycine residue recently demonstrated a neutralizing spot for toxin binding 13

FR3
Framework Region 3 Introducing a non-canonical disulfide bond into the hydrophobic core of llama VHHs between FR2 and FR3 proved to increase thermal stability at neutral pH and resistance to proteolytic degradation 14

CDR3
Complementarity-determining Region 3 30 to 50 residues after end of CDR2 3-25 residues The most variable region and considered as the center of antigen recognition 15

Nanobody thermostability
Nanobody thermostability or melting temperature (T m ) is crucial for their successful utilization, in their varied applications.In their storage and transport, thermostable antibodies can be handled without refrigeration, crucial in remote or low resource areas.In the human body, thermostable nanobodies have longer half-lives and greater efficacy due to lower degradation.In biocatalysis or bioremediation, thermostable nanobodies can remain effective at higher temperatures, if needed for the application 16,17 .Because of these benefits, several approaches in estimating T m s of proteins have been investigated such as through molecular dynamics simulations 18 , deep learning 19,20 , multiple machine learning ensembles 21 , and even early works of using correlative composition of dipeptides 22 .However, some of these studies are generalized too broadly to proteins overall, while others only classify the predicted T m s on a discrete scale (e.g., > 65 °C, between 55 and 65 °C, and < 55 °C) or being thermophilic versus non-thermophilic, which contributes to their unreliability of precise T m prediction specific to nanobodies.
Recently, predicted embeddings through the use of pre-trained models have been used as features for exploring biophysical properties of proteins 23 and even DNA sequences 24 .State-of-the-art models continue to demonstrate protein representations to aid in prediction, classification, and recognition tasks.Here we introduce TEMPRO: nanobody melting Temperature Estimation Model using PROtein embeddings.This tool aims to predict the melting temperature of a protein from its sequence alone and is specifically tailored for nanobodies.Along with the evaluation of the feasibility of using protein embeddings as a predictive feature, we have benchmarked the best-performing model against other quantifiable biophysical properties of nanobodies, several other machine learning techniques, and some of the latest prediction models from literature.

Nanobody datasets
A recently published dataset from the NbThermo database 25 includes 548 nanobody sequences with their corresponding organism source, experimentally measured melting temperatures (T m s in °C), target antigen, framework (FR) and complementarity-determining regions (CDR), and digital object identifier source information.We combined this with our in-house, manually curated nanobody dataset consisting of 166 sequences from multiple published studies (all raw and processed data can be found in our GitHub page in the Data Availability Statement).The 548 nanobodies from NbThermo contain 9 sequences with missing T m s, thus removed in the data cleaning.Subsequently, we combined the cleaned NbThermo nanobodies and our internal nanobody dataset, and then filtered duplicate sequences into a single observation.The processed data results into 567 unique sequences with their corresponding T m s which was then split into 80:20 ratio for training and validation sets.All sequences were restricted to the 20 natural amino acids and sequences with non-conventional residues were excluded.

Physicochemical characteristics
Global Sequence Signature analyses (GSS) by Kunz et al. 26 demonstrated that the residue compositions reflected in physicochemical properties including aliphatic index, charge and others contribute to nanobody stability, and thus we have included several of the same features in the model.Most of the biophysical properties quantified for the generated antibodies were generated from the Peptides R Package 27 which computes a set of physicochemical characteristics from the amino acid sequence.Other properties were computed through in-house calculations.These characteristics are summarized in Table 2.

Multi-agent stability prediction upon point mutations (MAESTRO)
MAESTRO is a structural method for predicting changes in protein stability which implements a multi-agent machine learning (ML) system, provides predicted free energy change (ΔΔG) values, and a corresponding prediction confidence estimation 50 .This confidence estimation for all 567 nanobodies were extracted and used as a predictive feature.

AlphaFold2 pLDDT scores with antibody numbering
Multiple protein modeling algorithms exist in literature, but tailoring a specific prediction tool for nanobodies do not always yield better model predictions.A multitude of variability also exists in the results comparing different prediction tools such as NanoNet 52 , IgFold 53 , ESMFold 54 , OmegaFold 55 , trRosetta 56 .However, it was previously shown that general protein modeling programs, such as AlphaFold2 57 , have been exposed to a wide and diverse set of protein structures which yields better predictions, especially with CDRs 58 .Therefore, with its ease of use and a well-established algorithm, all structural predictions of 567 nanobodies used in this study were predicted using AlphaFold2.In their structural predictions, they have observed high side-chain accuracy when the protein backbone prediction is accurate, and this is denoted by their predicted local-distance difference test (pLDDT) confidence measure.These confidence scores were extracted from each protein data bank (PDB) file outputs.
On the other hand, the antibody numbering used from the NbThermo database is the Aho numbering scheme 59 .We used the same numbering scheme for the other nanobodies with missing antibody annotations for consistency using ANARCI 60 .Ultimately, we have extracted the pLDDT scores for each nanobody region sequences (i.e., FRs 1-4 and CDRs 1-3).

Machine learning (ML) ensembles
This study used multiple machine learning ensembles and tools from the Scikit-learn Python module 62 , namely XGBoost (XGB) 63 , Random Forests (RF) 64 , Support Vector Machines (SVM) 65 , and Least Absolute Shrinkage and Selection Operator (LASSO) 66 .Each ML model training has been validated using repeated k-fold cross validation with five splits and three repetitions.

Deep neural networks (DNNs)
The early concept of backpropagation in artificial neural networks (ANNs) was introduced in the early work of David Rummelhart in 1986 to improve the memory of a network 67 and now clearly represented as a class of learning techniques by LeCun et al. 68 .Specifically, this study used deep neural networks 69 with one input, three hidden, and one output layers using the Keras Sequential API 70 in training the nanobody thermostability predictors.

Metadata Description
Length Number of amino acids in a sequence.The average sequence length of all 567 nanobodies is ~ 124 amino acids Aliphatic index Index represented by the relative volume occupied by aliphatic side chain of alanine, valine, isoleucine, and leucine; regarded as a positive factor for the increase of thermostability of globular proteins 28

Instability index
The predicted in vivo stability of a protein from its primary sequence 29 .An instability index of < 40 is considered as stable while > 40 as unstable

Hydrophobicity (GRAVY) index
GRand AVerage of hydropathY (GRAVY) index based on the KyteDoolittle scale 30 where positive values equate to a more hydrophobic characteristic and negative for hydrophilic.The hydrophilic region in a VHH accounts for the superior stability and solubility 31 Flexibility index Described as the symmetric/asymmetric distribution of amino acid residues in the protein 32 .Protein flexibility link structure and function in the context of adaptation to temperature, qualitatively defined as protein's capability to alter its conformation in response to a change in temperature 33 .Flexibility indices also correlate with the tendency of the side chain to be buried or exposed 34

Polarizability
One of the 26 physicochemical descriptor variables discussed by Sandberg et al. 35 that describes an amino acid's steric properties.The protein's spatial arrangement dictates its propensity to become polarized temporarily where higher polarizability favors more interactions.It was previously demonstrated that several cell types can contribute to antibody polarizability which results into differences in immune response and binding [36][37][38] Charge Net charge of a protein sequence based on the Henderson-Hasselbalch equation 39 using Lehninger scale 40 .The theoretical net charge of the complementarity-determining regions (CDRs) is a strong predictor of antibody specificity or the degree to which an immune response discriminates between antigenic variants 41,42 Tryptophan count Mainly used as factors for computing the extinction coefficient.However, these are included as features due to intrinsic fluorescence of proteins 43 during elevated temperature conditions (such as measuring T m using circular dichroism) which have distinct absorption wavelengths 44

Cysteine count
Cysteine residues serve essential roles in protein structure and function by conferring stability through disulfide bond formation which maintains proper maturation and localization through protein-protein intermolecular interactions 45 .This is also important in the antibody cysteine-based conjugation in determining framework regions through numbering Extinction coefficient A protein's extinction coefficient indicates how much light a protein absorbs at a certain wavelength and it is involved in protein purification 46 Absorbance Commonly, the optical absorption of proteins is measured at 280 nm 47 .According to the Beer-Lambert law, the concentration of a protein is directly proportional to its absorbance, at a defined wavelength and at a constant path length.This measure is also involved in structural characterization of proteins and some antibodies 48 Molecular weight Single-domain antibodies are composed of a variable domain of heavy chain fragments that are around 11-15 kDa 49 www.nature.com/scientificreports/

Software
The overall model training and analyses were performed using Jupyter Notebooks (Python 3.9.15)under Anaconda Distribution.Data preprocessing was conducted using R/RStudio.All figures were plotted using Python's Seaborn 71 module and three-dimensional protein structural representations rendered through PyMOL software.

Correlation calculations
Pairwise Pearson correlations were calculated between all structural and sequence properties of nanobodies.

Exploratory analyses uncover relationships between nanobody properties
Preliminary exploratory analyses were conducted using cross-correlations between features to identify relationships between the nanobodies' biophysical properties (due to the large, unlabeled metadata of protein embeddings, these were excluded in cross-correlations in Fig. 2).NetSurfP3 predictions mostly show positive correlation, but it is interesting to note that q3_C and q8_C coil structure states are highly correlated with the phi (N-Cα) www.nature.com/scientificreports/torsion angles, disorder, relative solvent accessibility (rsa), and absolute surface area (asa) values (approximately ρ > 0.5).It was reported by Kurgan et al. 72 that the predicted secondary structure using 1-dimensional descriptors of protein structure is useful in the prediction of disorder, flexible region, fold recognition and relative solvent accessibility.It was also previously observed that the backbone torsion angle and secondary structures of a protein are highly correlated 73 .In this context, the predicted coil states contribute to higher correlation of these structural properties as compared with alpha helices and beta strands.The intricacies of FR/CDR regions of nanobodies are instrumental in their sequence identity and arrangement such as the conserved frameworks (FR1-4) and variable CDRs, especially in the context of their thermostability (please refer to 74 for review), thus we see a slight crosscorrelation with AlphaFold2's pLDDTs scores.On another note, the predicted physicochemical characteristics (i.e., maestro_score, aliphatic index, and others) show low correlation with any of the features, with the exception of (1) molecular weight being highly correlated primarily with sequence length (ρ = 0.94), q3_C (ρ = 0.62), q8_C (ρ = 0.71), phi (ρ = 0.49), rsa (ρ = 0.42), and asa (ρ = 0.45); and (2) number of tryptophan in the sequence highly correlated with extinction (ρ = 0.87) and absorbance (ρ = 0.88) as compared with tyrosine and cysteine.In further investigating these cross-correlations, these features' predictive capabilities are further discussed below.
It may seem counterintuitive that the physicochemical properties of nanobodies barely contribute a relationship to each other.For example, Natesan et al. 75 reported that in the estimations of the solvent-accessible surface for individual cysteines in IgG1 antibody through molecular dynamics simulations, the interchange and hinge cysteines have > 1000 higher solvent accessibility as compared to intrachain cysteines.They suggest that the cysteine's accessibility to the surrounding solvent is one of the primary determinants of its disulfide bond stability.By this notion, although unrelated to estimating an antibody's melting temperature, the number of cysteines or any quantity that involves this amino acid should be at least correlated to any other biophysical property.However, Fig. 2 clearly shows that the number of cysteines is minimally correlated across all the features.Consequently, since the nanobody regions or antibody numbering are mainly governed by the positions of cysteine residues, we have also explored their pLDDT scores for each region and their relationship to their corresponding T m s.

Framework and complementarity-determining regions show extremely low correlation to nanobody thermostability using AlphaFold2 pLDDT scores
In light of the AlphaFold2's breakthrough in protein structure prediction, recent studies have used AlphaFold2's pLDDT scores for testing discriminative capabilities for classification and even in the context of analyzing antibody-antigen interface residue scores 76,77 .All 567 nanobodies' structural representations were predicted using AlphaFold2 for consistency where each processed nanobody has an output PDB file (included in the GitHub repository).Regarding the importance of cysteine residue positions in T m s of nanobodies, Saerens et al. 78 investigated a VHH that naturally had an extra pair of cysteines at positions Cys54 and Cys78 which led to a disulfide bond linking FR2 and FR3.Moreover, several studies have also demonstrated that introducing an additional disulfide bond into the hydrophobic core of llama VHHs between FR2 and FR3 resulted into an increased thermal stability at neutral pH, but also better resistance to proteolytic degradation [79][80][81] .Here, regional pLDDT scores of the 567 nanobodies display varying relationships with T m s wherein FR3 exhibits the highest correlation (ρ = 0.24) which suggests the importance of this region in thermostability (Fig. 3).Although there exists a measure of correlation between T m and AlphaFold2 pLDDT scores, the overall analyses exhibit extremely low correlations for all nanobody regions.The results suggest that there is not enough information between regional AlphaFold2's pLDDT scores and the nanobody thermostability to be of significant value for T m prediction.

Protein embeddings exhibit predictive qualities for nanobody thermostability
Protein language models in current literature are of growing interest in contexts of protein stability changes, protein sequencing/scaffold filling, and other biophysical property predictions 54,82,83 .Recently, protein representations or embeddings have been used for protein engineering, such as predicted solubility, thermophilicity, and other property prediction or classification tasks 19,23,24 .As the nanobodies' biophysical properties have displayed varying degrees of correlation depicted in Figs. 2 and 3, we have used each of these computed predictive features and compared each predicted protein structure embedding from the ESM models, in estimating nanobody thermostability (Fig. 4).
Due to the large number of predictors, we have initially opted to use a deep neural network (DNN) model using Keras Sequential API with 1 input layer, 3 hidden layers, and 1 output layer, to accommodate dataset complexities.Additionally, we have compared different lengths of model training and determined that the bestperforming training time equal to 1500 epochs (model training diagnostics can also be found in the GitHub repository).Thus, for consistency, we applied the same parameters to all predictive features in the dataset.
Preliminary results for DNN training reveal that protein embeddings exhibit better predictive capabilities than AlphaFold2 pLDDT scores (Fig. 4A), the nanobodies' physicochemical characteristics (Fig. 4B), and structural predictions from NetSurfP3 (Fig. 4C).Each of the pre-trained ESM models (Fig. 4D-F), overall, have better predictions with the ESM_15B Embeddings as the best-performing predictive feature (Fig. 4F) with a mean absolute error (mae) of 4.03 °C, root mean-squared error (rmse) of 5.66 °C, and coefficient of determination (r 2 ) of 71.4% validated on the test dataset.The second-order polynomial regression lines (in red) and purple scatterplots of T m s also indicate a better fit as compared with the other predictors.

DNN performs better than other machine learning techniques using protein embeddings
Upon investigating the preliminary results of using DNN on various nanobody features, we have compared these same features on other appropriate machine learning models with a proven track record of high performance and validity, such as gradient boosting, decision trees, and multiple ensemble algorithms.Similar to the previous analyses, the nanobodies' ESM embeddings outperformed all other features in estimating T m s (Fig. 5).
However, it is important to note that using a lesser number of predictors (e.g., NetSurfP3 has 23 columns/ features vs ESM embeddings ranging from 1280 to 5120 columns), XGBoost and Random Forest algorithms outperform DNN (Fig. 5A-C, 6b) while SVM and LASSO perform the worst.From this investigation, as the number of features grow larger and more complex, DNNs can provide a better T m approximation than standard ML techniques (Fig. 6).

Predictive performance of nanobody feature combinations
Using the best-performing model (i.e., DNN ESM_15B at 1500 epochs), we have used different permutations of predictors if combining the features can increase T m estimations.We have combined AlphaFold pLDDT scores, NetSurfP3, and physicochemical characteristics of nanobodies along with ESM_15B embeddings.The mae improved by ~ 29%, rmse by ~ 22%, and coefficient of determination increased by 32%, all of which were validated using the test set (Fig. 7A).However, this is not the same for all cases-in permuting the ESM_15B embeddings with each of the selected features, no improvements were observed.In fact, combining the two best predictive features (i.e., ESM_15B and ESM_3B) resulted in higher mean absolute error (4.03 °C and 4.24 °C, respectively; Fig. 7B), albeit relatively close mae test values.

TEMPRO predicts nanobody thermostability accurately
To compare our model predictions, we have explored other known melting temperature predictors in literature that can process a relatively large amount of protein sequences (Ref. 19and https:// github.com/ Gavri lenkoA/ Tm_ predi ction) and not classifying the predictors into a set of ranges or thresholds (Refs. 22,23,84and https:// github.com/ zhibi nlv/ BertTh ermo).Since our model is specifically tailored to nanobodies, it outperforms the other T m predictors (Fig. 8).For consistency, we used our saved DNN model to predict all the 567 nanobody sequences' T m instead of the test dataset.This has resulted in mae of 1.76 °C and rmse of 2.91 °C.www.nature.com/scientificreports/

Validation with another database
Furthermore, since NbThermo includes most of the nanobodies in current literature, the availability of data is scarce.To compare our pre-trained models with existing experimental data, several antibodies and their variants which contain experimentally-determined T m s were obtained from another database (https:// resea rch.natur alant ibody.com/) 85 that is separate from the existing dataset used for training and validation.The T m prediction of the sequences using our three different pre-trained models are summarized in Table 4.For robustness, the sequences www.nature.com/scientificreports/selected also displayed unusually low or high T m s ranging from ~ 46 to 88 °C (please see the reference column in Table 4).These experimental T m values are also found in the GitHub repository (listed here upon acceptance).When comparing the ESM models' results to the reference values using linear regression, ESM_15B is the best performer with an R 2 of 0.67, followed by ESM_3B with 0.58, and ESM_650M with 0.25.Except for the predictions of the T m of 4W68, where all models fared poorly, predictions made by ESM_15B and to a lesser extent 3B were relatively close to the reference T m s (Fig. 9).In contrast, the ProTDet and DeepStabP model predictions have no correlation with the reference T m s.

Discussion
We have demonstrated the applicability of protein embeddings as a predictive feature of estimating nanobodies' thermostability with a relatively high accuracy.This tool can help decrease the tedious and expensive production costs of manually deriving or synthesizing nanobodies to achieve or determine a desired property.VHHs can be remodeled, engineered and reassembled to generate new antibody-like molecules with desired properties and specificities, consisting of multiple units of the same or of different function 89 .As an example, Tomimoto et al. have previously demonstrated that a comprehensive mutagenesis approach exhibits low efficiency for identifying nanobody mutants with improved T m , but instead proposed a single mutation in VHH which can increase the T m by more than 5 °C while nearly maintaining affinity 90 .Overall, appropriately engineered biomolecules should exhibit pharmacokinetic properties that make them desirable to both production and application, especially an initial estimate of their melting temperature.

Comparisons of predictive features
Researchers have also previously demonstrated that the replacement of cysteine residues by appropriate amino acids will improve the heat resistance of a native VHH (please see 14 for a comprehensive review).Multiple studies also showed the effects of having up to three disulfide bonds in one nanobody where the three bonds consist of the bond found in the wild type nanobody, a bond analogous and a novel bond connecting CDR1 and FR3 78,91 , resulting into stability improvements of up to 19 °C.It was also previously reported that moderate modifications to CDR1 that may improve the nanobody stability while preserving its binding capacity which suggests that CDR1 can be also a region of interest in thermostability improvements 92 .With this premise, we have used predictions from AlphaFold2 to explore its capabilities as a predictive feature for each of the nanobody regions.However, as the low correlations suggest from Fig. 3, other nanobody regional measures or metrics should be considered.Although a considerable number of crystal structures of nanobodies have been registered in the Protein Data Bank, only few structures contain their T m s 18 .Additionally, new insights were gained by determining complexities of full-length antibodies from its early studies, such as from sharks, camelids, and humans to these single functional fragment antibodies.Basic biophysical properties such as amino acid lengths, hydrophobicity indices, cysteine counts, and instability indices of the nanobody sequences were preliminarily chosen as inspection tools for the bio-relevance of the predictors.The high stability of nanobodies indicated by the hydrophobicity, solvent accessibilities, and instability indices are well-sought biophysical properties which render their potential development as a therapeutic.Counter-intuitively, these features fail to increase the predictive power of machine learning techniques in estimating the nanobodies' thermostability.Relatively, using protein embeddings offer such an alternative for better T m estimation as demonstrated in this study.

Protein embeddings, hardware limitations, and deep learning predictions
Protein embeddings are numerical representations of encoding protein properties and functions which are suited for multiple tasks [93][94][95] .Each model has its own strengths and weaknesses (such as speed, memory footprint, computational cost).It is generally recognized that there is no "one size fits all" model when it comes to a specific task, even with suggested hardware requirements.Aside from the other nanobody characteristics previously discussed, we generated the 650M (650 million parameters, 2.4GB), 3B (3 billion, 5.3 GB), and 15B (15 billion, www.nature.com/scientificreports/28.2GB) ESM embeddings for the 567 sequences using an NVIDIA H100 GPU (please refer to https:// github.com/ faceb ookre search/ esm for number of layers, embedding dimensions, and other information).Generating the 567 embeddings for the largest pre-trained model (ESM 15B) required approximately 5 h of runtime.In contrast, both training the models highlighted in this work and predicting a nanobody's melting temperature complete in under a minute using an NVIDIA 2080 Ti GPU.These runtimes will vary depending on the hardware used.
Beyond their predictive value, embeddings can also enable the ability to highlight which amino acids are responsible for T m changes, and through this capturing the nanobody's T m changes upon point mutation can realistically be achieved.Mapping the pre-trained model's embeddings to specific residues could inform which residues, when altered, are most likely to yield a more thermostable nanobody.We therefore recommend this work for future research, as it could provide a mechanism for guiding small alterations that enhance nanobody thermostability, especially if paired with information that would allow for avoidance of point mutations that would negatively impact affinity and other desired attributes.
As modern biotechnology makes breakthroughs in protein sequence generation through generative deep learning methods, the prominent interest in antibody production has its merits.Multiple studies have demonstrated the feasibility of synthetically derived novel proteins through the use of deep-learning techniques [96][97][98][99] .Automated systems can process small or large sequence data through artificial neural networks that rely on many layers of nonlinear processing units for learning data representations.This architecture has been proven useful in protein synthesis molecular prediction algorithms-antimicrobial design, antibiotic discovery, and improved thermostability 83,97 .While such approaches continue to develop, these predictive techniques in determining biophysical properties of proteins from their sequence alone will continue to contribute to the design and evolution of synthetic biomolecules.4.

Figure 1 .
Figure 1.Nanobody structure and sequence characteristics.(A) Representation of a camelid heavy-chain antibody with VHH encircled in red.(B) Representative nanobody from PDB (ID: 1ZVH)-crystal structure of VHH domain from organism camelus dromedaries-CDR regions are highlighted in green (CDR1), blue (CDR2), and red (CDR3).(C) Sequence arrangement of a nanobody.CDR sequences are highlighted with the same color scheme.

Figure 2 .
Figure 2. Correlations between biophysical properties of nanobodies.Predicted values from NetSurfP3 display varying degrees of correlation with coil states (q3_C and q8_C) exhibiting high positive relationships with other structural features.AlphaFold2 pLDDT scores per region only exhibit relationships with each other, but minimal with other properties.Physicochemical characteristics from the R Peptides package have relatively low presence amongst other predicted biophysical properties of nanobodies with the exception of molecular weight and number of tryptophan in a nanobody sequence.

Figure 3 .
Figure 3. AlphaFold2 pLDDT scores per nanobody region vs T m s (°C).Framework regions' pLDDT scores are generally more correlated with T m s.Linear fit regression lines (dashed line with gray shades) were added through Seaborn module's regplot function.1ZVH nanobody was rendered using PyMOL software.

Figure 5 .
Figure 5. Nanobody feature performance by multiple machine learning models.Similar to previous analyses, various nanobody features were tested amongst several ML techniques to investigate predictive performance.(A-C) XGBoost and Random Forests perform better than DNN, SVM, and LASSO using a lower number of predictors; while (D-F) DNN performs better using ESM embeddings which contain thousands of columns of nanobody representations.

Figure 6 .
Figure 6.DNN vs ML performance.All performance metrics are validated using test data.(A) 1500 epochs displayed the best DNN performance with ESM_15B as the best predictor.(B) Random Forest outperforms all other ML models presented with a close tie with XGBoost.

Figure 7 .
Figure 7. Predictive performance of nanobody feature combinations.(A) Combined sets of predictors increased in performance upon inclusion of ESM_15B embeddings.(B) Each combination the best-performing feature (ESM_15B) amongst other predictors.Solely using ESM_15B embeddings as the predictive feature outperforms all cases.

Figure 8 .
Figure 8. Melting temperature prediction comparisons with other online tools.TEMPRO (orange boxplot) outperforms other prediction tools (i.e., ProTDet and DeepStabP) and closely resemble the distribution of the actual nanobody T m s (blue).

Figure 9 .
Figure 9. Experimental validation of nanobody T m s using pre-trained models from TEMPRO.The pre-trained ESM_15B parameter performs the best prediction in estimating the T m s of nanobody sequences not included in the NbThermo dataset with the highest coefficient of determination (R 2 = 0.67).Actual T m values are found in Table4. )

Table 4 .
Experimental validation of T m prediction using pre-trained models (all units in °C).